Breast cancer is a disease characterized by the uncontrolled growth of abnormal cells in the breast tissue, forming tumors that can potentially spread throughout the body. If not detected and treated promptly, these tumors can metastasize, leading to a fatal outcome. It arises in the lining cells (epithelium) of the ducts (85%) or lobules (15%) in the glandular tissue of the breast. The earliest form of the disease, known as in situ, is non-life-threatening and can often be detected at an early stage. However, once cancer cells invade surrounding breast tissue, they can form lumps and thickening, and may spread to nearby lymph nodes or other organs, leading to more severe and life-threatening conditions.
In 2022, there were 2.3 million new cases of breast cancer diagnosed worldwide, and approximately 670,000 deaths. The disease affects women of all ages after puberty, with increasing rates in later life. Global estimates reveal stark disparities in breast cancer incidence and mortality based on human development indices.
Female gender is the primary risk factor for breast cancer, with approximately 99% of cases occurring in women and 0.5–1% in men. Risk factors include increasing age, obesity, excessive alcohol consumption, family history of breast cancer, history of radiation exposure, reproductive history, tobacco use, and postmenopausal hormone therapy.
In the early stages, breast cancer may not present any noticeable symptoms, highlighting the importance of early detection. As the disease progresses, symptoms may include a breast lump or thickening, often without pain; changes in size, shape, or appearance of the breast; dimpling, redness, or other skin changes; alterations in the nipple or areola appearance; and abnormal or bloody discharge from the nipple.
This project seeks to address the question of improving breast cancer diagnostics by exploring how machine learning techniques can be applied to enhance the diagnostic process. Using image processing and manual measurements of cell characteristics from Fine Needle Aspiration (FNA) images, it aims to predict the probability that a diagnosed breast cancer case is malignant or benign. By leveraging machine learning algorithms, the project intends to create a tool that can assist healthcare professionals in making more accurate and timely diagnoses, ultimately contributing to better patient outcomes and advancing the field of cancer research. To achieve this, the project will utilize five classification models: Logistic Regression, K Nearest Neighbors (KNN), Random Forests, Support Vector Machines (SVM) with the RBF kernel, and Gradient Boosting.
This dataset contains information from digitized images of Fine Needle Aspirate (FNA) samples of breast masses. The images are analyzed to extract features that describe the characteristics of cell nuclei. This dataset is based on a study by K. P. Bennett and O. L. Mangasarian.
Attribute Information:
Features: 3-32: Ten real-valued features are computed for each cell nucleus, with three measurements for each feature (mean, standard error, and worst/largest value):
Exploratory Data Analysis (EDA) is a technique used to analyze datasets to extract characteristic information and obtain a comprehensive overview of the data's features. This process helps in discovering patterns, spotting anomalies, and testing hypotheses using basic statistical exploration and visualization tools.
#Loading my Libraries
from glob import glob
import pandas as pd
import statsmodels.api as sm
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from scipy import stats
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.utils.validation import check_is_fitted
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.impute import SimpleImputer
from category_encoders import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
# Reading data into Data Frame
df = pd.read_csv('data.csv')
# Display the first 5 rows
df.head()
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NaN |
| 1 | 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NaN |
| 2 | 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NaN |
| 3 | 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NaN |
| 4 | 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NaN |
5 rows × 33 columns
# Display the last 5 rows
df.tail()
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 564 | 926424 | M | 21.56 | 22.39 | 142.00 | 1479.0 | 0.11100 | 0.11590 | 0.24390 | 0.13890 | ... | 26.40 | 166.10 | 2027.0 | 0.14100 | 0.21130 | 0.4107 | 0.2216 | 0.2060 | 0.07115 | NaN |
| 565 | 926682 | M | 20.13 | 28.25 | 131.20 | 1261.0 | 0.09780 | 0.10340 | 0.14400 | 0.09791 | ... | 38.25 | 155.00 | 1731.0 | 0.11660 | 0.19220 | 0.3215 | 0.1628 | 0.2572 | 0.06637 | NaN |
| 566 | 926954 | M | 16.60 | 28.08 | 108.30 | 858.1 | 0.08455 | 0.10230 | 0.09251 | 0.05302 | ... | 34.12 | 126.70 | 1124.0 | 0.11390 | 0.30940 | 0.3403 | 0.1418 | 0.2218 | 0.07820 | NaN |
| 567 | 927241 | M | 20.60 | 29.33 | 140.10 | 1265.0 | 0.11780 | 0.27700 | 0.35140 | 0.15200 | ... | 39.42 | 184.60 | 1821.0 | 0.16500 | 0.86810 | 0.9387 | 0.2650 | 0.4087 | 0.12400 | NaN |
| 568 | 92751 | B | 7.76 | 24.54 | 47.92 | 181.0 | 0.05263 | 0.04362 | 0.00000 | 0.00000 | ... | 30.37 | 59.16 | 268.6 | 0.08996 | 0.06444 | 0.0000 | 0.0000 | 0.2871 | 0.07039 | NaN |
5 rows × 33 columns
# Check the shape of the dataset
df.shape
(569, 33)
# Display column names
df.columns
Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
dtype='object')
# Get data types of each column
df.dtypes
id int64 diagnosis object radius_mean float64 texture_mean float64 perimeter_mean float64 area_mean float64 smoothness_mean float64 compactness_mean float64 concavity_mean float64 concave points_mean float64 symmetry_mean float64 fractal_dimension_mean float64 radius_se float64 texture_se float64 perimeter_se float64 area_se float64 smoothness_se float64 compactness_se float64 concavity_se float64 concave points_se float64 symmetry_se float64 fractal_dimension_se float64 radius_worst float64 texture_worst float64 perimeter_worst float64 area_worst float64 smoothness_worst float64 compactness_worst float64 concavity_worst float64 concave points_worst float64 symmetry_worst float64 fractal_dimension_worst float64 Unnamed: 32 float64 dtype: object
# Checking the distribution of categorical variables
categorical_columns = df.select_dtypes(include=['object']).columns
for column in categorical_columns:
print(f'\nDistribution of {column}:')
print(df[column].value_counts())
Distribution of diagnosis: diagnosis B 357 M 212 Name: count, dtype: int64
#Checking for Duplicate rows
df.duplicated().sum()
0
# Check for missing values
df.isnull().sum()
id 0 diagnosis 0 radius_mean 0 texture_mean 0 perimeter_mean 0 area_mean 0 smoothness_mean 0 compactness_mean 0 concavity_mean 0 concave points_mean 0 symmetry_mean 0 fractal_dimension_mean 0 radius_se 0 texture_se 0 perimeter_se 0 area_se 0 smoothness_se 0 compactness_se 0 concavity_se 0 concave points_se 0 symmetry_se 0 fractal_dimension_se 0 radius_worst 0 texture_worst 0 perimeter_worst 0 area_worst 0 smoothness_worst 0 compactness_worst 0 concavity_worst 0 concave points_worst 0 symmetry_worst 0 fractal_dimension_worst 0 Unnamed: 32 569 dtype: int64
It can be seen that the whole column "unnamed: 32" has NaN values.
# dropping 'Unnamed: 32' column.
df.drop("Unnamed: 32", axis=1, inplace=True)
# dropping id column (Wont be Important in this analysis)
df.drop('id', axis=1, inplace=True)
# descriptive statistics of data
df.describe()
| radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | fractal_dimension_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | ... | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 |
| mean | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | 0.062798 | ... | 16.269190 | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 |
| std | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | 0.007060 | ... | 4.833242 | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 |
| min | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | 0.049960 | ... | 7.930000 | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 |
| 25% | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | 0.057700 | ... | 13.010000 | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 |
| 50% | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | 0.061540 | ... | 14.970000 | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 |
| 75% | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | 0.066120 | ... | 18.790000 | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 |
| max | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | 0.097440 | ... | 36.040000 | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 |
8 rows × 30 columns
# Plot histograms for numerical columns
import matplotlib.pyplot as plt
# Plot histograms for numerical columns with all histograms in green
df.hist(bins=20, figsize=(20, 15))
plt.show()
Several features, such as radius_mean, perimeter_mean, and area_mean, exhibit significant right skewness, indicating the presence of outliers with large values. Conversely, features like texture_mean, smoothness_mean, symmetry_mean, and fractal_dimension_mean display distributions that are more symmetric and closer to normal.
The standard error features (_se) also show right-skewed distributions, suggesting lower values for most instances with a few higher outliers. The "worst" features (_worst) present wide distributions with varying degrees of skewness, reflecting the largest measurements.
#Create Countplot of Diagnosis
ax = sns.countplot(x='diagnosis', data=df, palette=['#FF9999','#66B2FF'])
# Set title
plt.title('Countplot of Diagnosis')
# Create custom legend
from matplotlib.patches import Patch
legend_labels = ['Malignant', 'Benign']
legend_colors = ['#FF9999','#66B2FF']
handles = [Patch(color=color, label=label) for color, label in zip(legend_colors, legend_labels)]
plt.legend(handles=handles, title='Diagnosis')
# Show the plot
plt.show()
The countplot displays the distribution of breast cancer diagnoses in the dataset, highlighting that there are more benign cases (357, shown in blue) than malignant cases (212, shown in red). The plot provides a clear visual comparison between the two categories, indicating that benign diagnoses are more prevalent in this dataset.
# Calculate the correlation matrix
df_numeric = df.drop(columns=['diagnosis'])
correlation_matrix = df_numeric.corr()
# Plot heatmap for the correlation matrix
plt.figure(figsize=(20, 18))
sns.heatmap(correlation_matrix, annot=True, linewidths=.5, cmap='coolwarm', center=0)
plt.show()
The heatmap shows the correlation matrix for the features in the breast cancer dataset. High positive correlations (closer to 1) are highlighted in dark red, while negative correlations (closer to -1) are in blue. It reveals strong correlations among certain features, such as 'radius_mean', 'perimeter_mean', and 'area_mean', indicating that they tend to increase together. This visualization helps in understanding the relationships between different features, which is crucial for feature selection and model building.
# Distribution plots (distplots) show the distribution of numerical features.
num_features = len(df.columns) - 2
num_cols = 3
num_rows = (num_features + num_cols - 1) // num_cols
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, num_rows * 5))
fig.tight_layout(pad=3.0)
axes = axes.flatten()
for i, column in enumerate(df.columns[2:]):
sns.kdeplot(df[column], ax=axes[i], color='orange', fill=True, linewidth=2) # Fill KDE plot with red color
axes[i].set_title(f'Distribution of {column}')
# Remove any unused subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.show()
Most features, such as radius_mean, perimeter_mean, and area_mean, exhibit right-skewed distributions, indicating a concentration of lower values with fewer higher values. Features like texture_mean and smoothness_mean show more symmetric distributions, suggesting a more uniform spread. The standard error features are tightly clustered, indicating low variability within cell measurements. The 'worst' case features, such as radius_worst and texture_worst, highlight the extreme values in the dataset.
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
mean_of_measurements= ['diagnosis','radius_mean' , 'perimeter_mean' , 'area_mean' , 'concavity_mean' , 'concave points_mean']
custom_palette = sns.color_palette("colorblind", 2)
sns.pairplot(df[mean_of_measurements], hue='diagnosis', palette=custom_palette)
<seaborn.axisgrid.PairGrid at 0x153e8e150>
mean_of_chracteristics= ['diagnosis','texture_mean', 'smoothness_mean', 'compactness_mean', 'symmetry_mean', 'fractal_dimension_mean']
custom_palette = sns.color_palette("colorblind", 2)
sns.pairplot(df[mean_of_chracteristics], hue='diagnosis', palette=custom_palette)
<seaborn.axisgrid.PairGrid at 0x1530ef850>
Standard_Error_of_Measurements= ['diagnosis','radius_se', 'perimeter_se', 'area_se', 'concavity_se', 'concave points_se']
custom_palette = sns.color_palette("colorblind", 2)
sns.pairplot(df[Standard_Error_of_Measurements], hue='diagnosis', palette=custom_palette)
<seaborn.axisgrid.PairGrid at 0x16b5e1dd0>
Standard_Error_of_chracteristics= ['diagnosis','texture_se', 'smoothness_se', 'compactness_se', 'symmetry_se', 'fractal_dimension_se']
custom_palette = sns.color_palette("colorblind", 2)
sns.pairplot(df[Standard_Error_of_chracteristics], hue='diagnosis', palette=custom_palette)
<seaborn.axisgrid.PairGrid at 0x16c8d1d10>
Worst_of_Measurements= ['diagnosis','radius_worst', 'perimeter_worst', 'area_worst', 'concavity_worst', 'concave points_worst']
custom_palette = sns.color_palette("colorblind", 2)
sns.pairplot(df[Worst_of_Measurements], hue='diagnosis', palette=custom_palette)
<seaborn.axisgrid.PairGrid at 0x16d5dce90>
Worst_of_Characteristics= ['diagnosis','texture_worst', 'smoothness_worst', 'compactness_worst', 'symmetry_worst', 'fractal_dimension_worst']
custom_palette = sns.color_palette("colorblind", 2)
sns.pairplot(df[Worst_of_Characteristics], hue='diagnosis', palette=custom_palette)
<seaborn.axisgrid.PairGrid at 0x16d679990>
# Convert 'diagnosis' to numeric
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
# Count the occurrences of each diagnosis type
diagnosis_counts = df['diagnosis'].value_counts()
print("\nCounts of each diagnosis type:")
print(diagnosis_counts)
Counts of each diagnosis type: diagnosis 0 357 1 212 Name: count, dtype: int64
# selecting features based on correlation threshold
correlation_threshold = 0.75
correlation_matrix = df.corr()
upper_tri = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))
high_correlation_features = [column for column in upper_tri.columns if any(upper_tri[column] > correlation_threshold)]
print("Highly correlated features:")
print(high_correlation_features)
Highly correlated features: ['perimeter_mean', 'area_mean', 'concavity_mean', 'concave points_mean', 'perimeter_se', 'area_se', 'concavity_se', 'concave points_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'fractal_dimension_worst']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
df.drop('diagnosis', axis=1),
df['diagnosis'],
test_size=0.2,
random_state=42
)
# Print the shapes of the resulting datasets
print("Shape of training set (features):", X_train.shape)
print("Shape of test set (features):", X_test.shape)
print("Shape of training set (target):", y_train.shape)
print("Shape of test set (target):", y_test.shape)
Shape of training set (features): (455, 30) Shape of test set (features): (114, 30) Shape of training set (target): (455,) Shape of test set (target): (114,)
# Initialize the StandardScaler
scaler = StandardScaler()
# Fit the scaler on the training data and transform the training data
X_train_scaled = scaler.fit_transform(X_train)
# Transform the test data using the same scaler
X_test_scaled = scaler.transform(X_test)
print("Scaled Training Data:")
print(X_train_scaled)
print("\nScaled Test Data:")
print(X_test_scaled)
Scaled Training Data: [[-1.44075296 -0.43531947 -1.36208497 ... 0.9320124 2.09724217 1.88645014] [ 1.97409619 1.73302577 2.09167167 ... 2.6989469 1.89116053 2.49783848] [-1.39998202 -1.24962228 -1.34520926 ... -0.97023893 0.59760192 0.0578942 ] ... [ 0.04880192 -0.55500086 -0.06512547 ... -1.23903365 -0.70863864 -1.27145475] [-0.03896885 0.10207345 -0.03137406 ... 1.05001236 0.43432185 1.21336207] [-0.54860557 0.31327591 -0.60350155 ... -0.61102866 -0.3345212 -0.84628745]] Scaled Test Data: [[-0.46649743 -0.13728933 -0.44421138 ... -0.19435087 0.17275669 0.20372995] [ 1.36536344 0.49866473 1.30551088 ... 0.99177862 -0.561211 -1.00838949] [ 0.38006578 0.06921974 0.40410139 ... 0.57035018 -0.10783139 -0.20629287] ... [-0.73547237 -0.99852603 -0.74138839 ... -0.27741059 -0.3820785 -0.32408328] [ 0.02898271 2.0334026 0.0274851 ... -0.49027026 -1.60905688 -0.33137507] [ 1.87216885 2.80077153 1.80354992 ... 0.7925579 -0.05868885 -0.09467243]]
from sklearn.neighbors import KNeighborsClassifier
# List to store error rates
error_rate = []
# Iterate over possible values for n_neighbors
for i in range(1, 42):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train_scaled, y_train) # Use scaled data
pred_i = knn.predict(X_test_scaled) # Use scaled data
error_rate.append(np.mean(pred_i != y_test))
# Print or analyze the error rates to find the optimal number of neighbors
print("Error Rates for different values of k:")
print(error_rate)
Error Rates for different values of k: [0.06140350877192982, 0.05263157894736842, 0.05263157894736842, 0.043859649122807015, 0.05263157894736842, 0.043859649122807015, 0.05263157894736842, 0.043859649122807015, 0.03508771929824561, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015]
plt.figure(figsize=(12, 6))
plt.plot(range(1, 42), error_rate, color='red', linestyle='--',
marker='o', markersize=8, markerfacecolor='b')
plt.title('Error Rate vs. Number of Neighbors (k)')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Error Rate')
plt.grid(True)
plt.show()
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Initialize the K-Nearest Neighbors classifier with the optimal k value
optimal_k = 9
knn = KNeighborsClassifier(n_neighbors=optimal_k)
# Train the model with the scaled training data
knn.fit(X_train_scaled, y_train)
# Make predictions with the scaled test data
y_pred = knn.predict(X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
# Print evaluation metrics
print("Accuracy Score:", accuracy)
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
Accuracy Score: 0.9649122807017544
Confusion Matrix:
[[69 2]
[ 2 41]]
Classification Report:
precision recall f1-score support
0 0.97 0.97 0.97 71
1 0.95 0.95 0.95 43
accuracy 0.96 114
macro avg 0.96 0.96 0.96 114
weighted avg 0.96 0.96 0.96 114
# Plotting the confusion matrix heatmap
plt.figure(figsize=(8, 6))
conf_matrix_knn = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix_knn, annot=True, fmt='d', cmap='coolwarm',
xticklabels=['Class 0', 'Class 1'],
yticklabels=['Class 0', 'Class 1'])
plt.title('Confusion Matrix Heatmap (K-Nearest Neighbors)')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
from sklearn.metrics import roc_curve, auc
# Compute ROC curve and AUC for KNN
fpr_knn, tpr_knn, _ = roc_curve(y_test, knn.predict_proba(X_test_scaled)[:, 1])
roc_auc_knn = auc(fpr_knn, tpr_knn)
# Plot ROC curve for KNN
plt.figure(figsize=(10, 6))
plt.plot(fpr_knn, tpr_knn, color='#1f77b4', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_knn) # Blue color
plt.plot([0, 1], [0, 1], color='#ff7f0e', lw=2, linestyle='--') # Orange color
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) - KNN')
plt.legend(loc='lower right')
plt.show()
# Initialize the Logistic Regression model
log_reg = LogisticRegression(max_iter=10000, random_state=42)
# Train the model
log_reg.fit(X_train_scaled, y_train)
# Make predictions
y_pred_log_reg = log_reg.predict(X_test_scaled)
# Evaluate the Logistic Regression model
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
conf_matrix_log_reg = confusion_matrix(y_test, y_pred_log_reg)
class_report_log_reg = classification_report(y_test, y_pred_log_reg)
# Print evaluation metrics
print("Accuracy Score (Logistic Regression):", accuracy_log_reg)
print("\nConfusion Matrix (Logistic Regression):")
print(conf_matrix_log_reg)
print("\nClassification Report (Logistic Regression):")
print(class_report_log_reg)
Accuracy Score (Logistic Regression): 0.9736842105263158
Confusion Matrix (Logistic Regression):
[[70 1]
[ 2 41]]
Classification Report (Logistic Regression):
precision recall f1-score support
0 0.97 0.99 0.98 71
1 0.98 0.95 0.96 43
accuracy 0.97 114
macro avg 0.97 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114
# Plotting the confusion matrix heatmap
plt.figure(figsize=(8, 6))
conf_matrix_log_reg = confusion_matrix(y_test, y_pred_log_reg)
sns.heatmap(conf_matrix_log_reg, annot=True, fmt='d', cmap='coolwarm',
xticklabels=['Class 0', 'Class 1'],
yticklabels=['Class 0', 'Class 1'])
plt.title('Confusion Matrix Heatmap (Logistic Regression)')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
# Initialize the Random Forest model
random_forest = RandomForestClassifier(n_estimators=400, random_state=42)
# Train the model
random_forest.fit(X_train_scaled, y_train)
# Make predictions
y_pred_rf = random_forest.predict(X_test_scaled)
# Evaluate the Random Forest model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
class_report_rf = classification_report(y_test, y_pred_rf)
# Print evaluation metrics
print("Accuracy Score (Random Forest):", accuracy_rf)
print("\nConfusion Matrix (Random Forest):")
print(conf_matrix_rf)
print("\nClassification Report (Random Forest):")
print(class_report_rf)
Accuracy Score (Random Forest): 0.9649122807017544
Confusion Matrix (Random Forest):
[[70 1]
[ 3 40]]
Classification Report (Random Forest):
precision recall f1-score support
0 0.96 0.99 0.97 71
1 0.98 0.93 0.95 43
accuracy 0.96 114
macro avg 0.97 0.96 0.96 114
weighted avg 0.97 0.96 0.96 114
from sklearn.metrics import roc_curve, auc
# Compute ROC curve and AUC
fpr_rf, tpr_rf, _ = roc_curve(y_test, random_forest.predict_proba(X_test_scaled)[:, 1])
roc_auc_rf = auc(fpr_rf, tpr_rf)
# Plot ROC curve
plt.figure(figsize=(10, 6))
plt.plot(fpr_rf, tpr_rf, color='#1f77b4', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_rf) # Blue color
plt.plot([0, 1], [0, 1], color='#ff7f0e', lw=2, linestyle='--') # Orange color
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) - Random Forest')
plt.legend(loc='lower right')
plt.show()
from sklearn.metrics import precision_recall_curve
# Compute Precision-Recall curve and AUC
precision_rf, recall_rf, _ = precision_recall_curve(y_test, random_forest.predict_proba(X_test_scaled)[:, 1])
pr_auc_rf = auc(recall_rf, precision_rf)
# Plot Precision-Recall curve
plt.figure(figsize=(10, 6))
plt.plot(recall_rf, precision_rf, color='orange', lw=2, label='Precision-Recall curve (area = %0.2f)' % pr_auc_rf)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve - Random Forest')
plt.legend(loc='best')
plt.show()
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Initialize the SVM model with an RBF kernel
svm_model_rbf = SVC(kernel='rbf', random_state=42)
# Train the model
svm_model_rbf.fit(X_train_scaled, y_train)
# Make predictions
y_pred_svm_rbf = svm_model_rbf.predict(X_test_scaled)
# Evaluate the SVM model
accuracy_svm_rbf = accuracy_score(y_test, y_pred_svm_rbf)
conf_matrix_svm_rbf = confusion_matrix(y_test, y_pred_svm_rbf)
class_report_svm_rbf = classification_report(y_test, y_pred_svm_rbf)
# Print evaluation metrics
print("Accuracy Score (SVM with RBF kernel):", accuracy_svm_rbf)
print("\nConfusion Matrix (SVM with RBF kernel):")
print(conf_matrix_svm_rbf)
print("\nClassification Report (SVM with RBF kernel):")
print(class_report_svm_rbf)
Accuracy Score (SVM with RBF kernel): 0.9824561403508771
Confusion Matrix (SVM with RBF kernel):
[[71 0]
[ 2 41]]
Classification Report (SVM with RBF kernel):
precision recall f1-score support
0 0.97 1.00 0.99 71
1 1.00 0.95 0.98 43
accuracy 0.98 114
macro avg 0.99 0.98 0.98 114
weighted avg 0.98 0.98 0.98 114
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting the confusion matrix heatmap
plt.figure(figsize=(8, 6))
conf_matrix_svm_rbf = confusion_matrix(y_test, y_pred_svm_rbf)
sns.heatmap(conf_matrix_svm_rbf, annot=True, fmt='d', cmap='coolwarm',
xticklabels=['Class 0', 'Class 1'],
yticklabels=['Class 0', 'Class 1'])
plt.title('Confusion Matrix Heatmap (SVM with RBF Kernel)')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
from sklearn.metrics import roc_curve, auc
# Compute ROC curve and AUC for SVM with RBF kernel
fpr_svm_rbf, tpr_svm_rbf, _ = roc_curve(y_test, svm_model_rbf.decision_function(X_test_scaled))
roc_auc_svm_rbf = auc(fpr_svm_rbf, tpr_svm_rbf)
# Plot ROC curve for SVM with RBF kernel
plt.figure(figsize=(10, 6))
plt.plot(fpr_svm_rbf, tpr_svm_rbf, color='#1f77b4', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_svm_rbf) # Blue color
plt.plot([0, 1], [0, 1], color='#ff7f0e', lw=2, linestyle='--') # Orange color
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) - SVM with RBF Kernel')
plt.legend(loc='lower right')
plt.show()
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Initialize the Gradient Boosting model
gb_model = GradientBoostingClassifier(random_state=42)
# Train the model
gb_model.fit(X_train_scaled, y_train)
# Make predictions
y_pred_gb = gb_model.predict(X_test_scaled)
# Evaluate the Gradient Boosting model
accuracy_gb = accuracy_score(y_test, y_pred_gb)
conf_matrix_gb = confusion_matrix(y_test, y_pred_gb)
class_report_gb = classification_report(y_test, y_pred_gb)
# Print evaluation metrics
print("Accuracy Score (Gradient Boosting):", accuracy_gb)
print("\nConfusion Matrix (Gradient Boosting):")
print(conf_matrix_gb)
print("\nClassification Report (Gradient Boosting):")
print(class_report_gb)
Accuracy Score (Gradient Boosting): 0.956140350877193
Confusion Matrix (Gradient Boosting):
[[69 2]
[ 3 40]]
Classification Report (Gradient Boosting):
precision recall f1-score support
0 0.96 0.97 0.97 71
1 0.95 0.93 0.94 43
accuracy 0.96 114
macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting the confusion matrix heatmap
plt.figure(figsize=(8, 6))
conf_matrix_gb = confusion_matrix(y_test, y_pred_gb)
sns.heatmap(conf_matrix_gb, annot=True, fmt='d', cmap='coolwarm',
xticklabels=['Class 0', 'Class 1'],
yticklabels=['Class 0', 'Class 1'])
plt.title('Confusion Matrix Heatmap (Gradient Boosting)')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
from sklearn.metrics import roc_curve, auc
# Compute ROC curve and AUC for Gradient Boosting
fpr_gb, tpr_gb, _ = roc_curve(y_test, gb_model.predict_proba(X_test_scaled)[:, 1])
roc_auc_gb = auc(fpr_gb, tpr_gb)
# Plot ROC curve for Gradient Boosting
plt.figure(figsize=(10, 6))
plt.plot(fpr_gb, tpr_gb, color='#1f77b4', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_gb) # Blue color
plt.plot([0, 1], [0, 1], color='#ff7f0e', lw=2, linestyle='--') # Orange color
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) - Gradient Boosting')
plt.legend(loc='lower right')
plt.show()
print("Accuracy Score:", accuracy)
Accuracy Score: 0.9649122807017544
print("Accuracy Score (Logistic Regression):", accuracy_log_reg)
Accuracy Score (Logistic Regression): 0.9736842105263158
print("Accuracy Score (Random Forest):", accuracy_rf)
Accuracy Score (Random Forest): 0.9649122807017544
print("Accuracy Score (SVM with RBF kernel):", accuracy_svm_rbf)
Accuracy Score (SVM with RBF kernel): 0.9824561403508771
print("Accuracy Score (Gradient Boosting):", accuracy_gb)
Accuracy Score (Gradient Boosting): 0.956140350877193
# Accuracy scores
models = [
"KNN",
"Logistic Regression",
"Random Forest",
"SVM with RBF Kernel",
"Gradient Boosting"
]
accuracies = [
accuracy,
accuracy_log_reg,
accuracy_rf,
accuracy_svm_rbf,
accuracy_gb
]
# Sort the models and accuracies based on accuracy scores in ascending order
sorted_indices = np.argsort(accuracies)[::1]
sorted_models = np.array(models)[sorted_indices]
sorted_accuracies = np.array(accuracies)[sorted_indices]
# Create the horizontal bar plot
plt.figure(figsize=(10, 6))
bars = plt.barh(sorted_models, sorted_accuracies, color=['#E69F00', '#56B4E9', '#009E73', '#F0E442', '#0072B2'])
# Add percentage labels
for bar in bars:
plt.text(
bar.get_width() + 0.02, # X position of the text
bar.get_y() + bar.get_height() / 2, # Y position of the text
f"{bar.get_width()*100:.2f}%", # Label as percentage
va='center', # Vertical alignment of the text
ha='left' # Horizontal alignment of the text
)
plt.xlabel('Accuracy Score')
plt.ylabel('Models')
plt.title('Accuracy Scores of Different Models with Percentages (Sorted)')
plt.xlim([0, 1]) # Accuracy scores range from 0 to 1
plt.show()
This project aimed to enhance breast cancer diagnostics by applying various machine learning techniques to predict the malignancy of breast cancer cases based on Fine Needle Aspiration (FNA) image data. By evaluating five different classification models—Logistic Regression, K-Nearest Neighbors (KNN), Random Forests, Support Vector Machines (SVM) with the RBF kernel, and Gradient Boosting—the project sought to identify the most effective approach for improving diagnostic accuracy and aiding healthcare professionals in making more precise and timely diagnoses.
The results of the project indicate the following accuracy scores for each model:
Among the models tested, the SVM with RBF kernel achieved the highest accuracy score of 0.9825 (98.25%), demonstrating its superior performance in distinguishing between malignant and benign cases. Logistic Regression and KNN followed closely, both providing robust accuracy scores of 0.9737 and 0.9649, respectively. Random Forests also performed well with an accuracy of 0.9649, while Gradient Boosting, though slightly less accurate at 0.9561, still contributed valuable insights.
These findings underscore the potential of machine learning algorithms in advancing breast cancer diagnostics. The SVM with RBF kernel, in particular, shows promise as a tool for enhancing diagnostic accuracy and aiding in early detection. By integrating such machine learning models into clinical practice, there is potential for improved patient outcomes and a significant contribution to the ongoing research in cancer diagnosis.